Skip to content

Add split-and-retry path to GpuProjectExec#14724

Open
thirtiseven wants to merge 5 commits intoNVIDIA:mainfrom
thirtiseven:project_split_retry
Open

Add split-and-retry path to GpuProjectExec#14724
thirtiseven wants to merge 5 commits intoNVIDIA:mainfrom
thirtiseven:project_split_retry

Conversation

@thirtiseven
Copy link
Copy Markdown
Collaborator

Fixes #14191.

Description

GpuProjectExec currently retries OOM via withRetryNoSplit only — if a projection runs cuDF kernels with internal scratch allocations the pre-split estimator cannot see (regex, string-replace, etc.), an OOM during the projection cannot be recovered and fails the task.

This PR adds a split-and-retry path: for purely deterministic projections, GpuProjectExec now drives the projection through RmmRapidsRetryIterator.withRetry(splitSpillableInHalfByRows). On GPU OOM the input batch is halved by rows and the projection is re-run on each half; the resulting sub-batches are concatenated back via ConcatAndConsumeAll.buildNonEmptyBatchFromTypes to preserve the single-batch contract of projectAndCloseWithRetrySingleBatch.

Mixed deterministic + non-deterministic projections fall through to the existing withRetryNoSplit path: the non-deterministic side is computed once on the full input batch and stitched row-by-row to the deterministic side, and row-splitting either side would break that alignment. Both GpuProjectExec.projectWithRetrySingleBatch and GpuTieredProject.projectWithRetrySingleBatchInternal dispatch on the same condition (forall(_.deterministic) / areAllDeterministic).

A new internal config spark.rapids.sql.projectExec.splitRetry.enabled (default true) gates the new path so it can be disabled to revert to the prior behavior if regressions surface.

Checklists

Documentation

  • Updated for new or modified user-facing features or behaviors
  • No user-facing change

The new config is internal() and the public behavior is unchanged in non-OOM cases.

Testing

  • Added or modified tests to cover new code paths
  • Covered by existing tests
  • Not required

tests/src/test/scala/com/nvidia/spark/rapids/ProjectSplitRetrySuite.scala adds five cases covering both dispatch sites:

  • split-retry on GpuProjectExec.projectAndCloseWithRetrySingleBatch with forceSplitAndRetryOOM produces output equal to a single-batch projection
  • with the conf disabled, the same injection propagates GpuSplitAndRetryOOM through the legacy path
  • the same coverage on GpuTieredProject.projectAndCloseWithRetrySingleBatch
  • a mixed deterministic + GpuRand projection routes through the legacy retry path (verified by comparing the rand column to a non-injected reference run)
  • a plain forceRetryOOM on the new path returns a single piece without splitting

Performance

  • Tests ran and results are added in the PR description
  • Issue filed with a link in the PR description
  • Not required

TBD — perf testing to follow.

Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 5, 2026

Greptile Summary

This PR adds a split-and-retry path to GpuProjectExec and GpuTieredProject for purely deterministic projections: on GPU OOM the input batch is halved by rows, each half is projected independently, and results are concatenated back via ConcatAndConsumeAll.buildNonEmptyBatchFromTypes. The sb.incRefCount() guard correctly compensates for the outer withResource(sb) close in both dispatch sites. One resource-lifecycle concern: in runWithSplitRetry, closeOnExcept(pieces) wraps the buildNonEmptyBatchFromTypes call, which already closes all input batches in its own finally block — so if Table.concatenate throws OOM, both paths fire close() on the same underlying GpuColumnVector objects.

Confidence Score: 4/5

Safe to merge after verifying the double-close of pieces batches in runWithSplitRetry when Table.concatenate throws; all other changes are logically correct and well-tested.

The incRefCount() guard at both call sites is correctly applied and the new split-retry logic is sound. The one issue is in runWithSplitRetry: closeOnExcept(pieces) still holds live references when buildNonEmptyBatchFromTypes is called; if the concat throws, that function's finally closes the batches and then closeOnExcept fires a second close on the same cuDF-native objects, driving ref-counts negative.

sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala — specifically the closeOnExcept(pieces) scope in runWithSplitRetry.

Important Files Changed

Filename Overview
sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala Adds split-and-retry path to GpuProjectExec and GpuTieredProject; ref-count handling for sb is correct via incRefCount(), but runWithSplitRetry has a double-close risk on the pieces ArrayBuffer when buildNonEmptyBatchFromTypes throws in the multi-batch concat path.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RapidsConf.scala Adds PROJECT_SPLIT_RETRY_ENABLED internal conf (default true) with clear doc string and lazy accessor; follows existing config conventions correctly.
tests/src/test/scala/com/nvidia/spark/rapids/ProjectSplitRetrySuite.scala Five focused unit tests covering split-retry, conf-disabled fallback, tiered projection, mixed deterministic+GpuRand routing, and plain GpuRetryOOM; afterEach resets retry counters properly.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[projectAndCloseWithRetrySingleBatch\nwithResource sb] --> B[projectWithRetrySingleBatch]
    B --> C{splitRetry.enabled\n&& all deterministic?}
    C -- yes --> D[sb.incRefCount\nrunWithSplitRetry]
    D --> E[withRetry sb\nsplitSpillableInHalfByRows]
    E --> F{OOM?}
    F -- split --> G[run projection\non each half]
    G --> H[collect pieces\nArrayBuffer]
    H --> I[buildNonEmptyBatchFromTypes\nconcat & close pieces]
    I --> J[return single ColumnarBatch]
    F -- no OOM --> K[run projection\non full batch]
    K --> J
    C -- no --> L[withRetryNoSplit path\nexisting logic]
    L --> J
Loading

Reviews (2): Last reviewed commit: "address coemments" | Re-trigger Greptile

Comment thread sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala Outdated
@thirtiseven thirtiseven self-assigned this May 5, 2026
Signed-off-by: Haoyang Li <haoyangl@nvidia.com>
Comment on lines +265 to +275
val pieces = ArrayBuffer[ColumnarBatch]()
closeOnExcept(pieces) { _ =>
while (resultIter.hasNext) {
pieces += resultIter.next()
}
val outputTypes = (0 until pieces.head.numCols()).map { i =>
pieces.head.column(i).asInstanceOf[GpuColumnVector].dataType()
}.toArray
ConcatAndConsumeAll.buildNonEmptyBatchFromTypes(pieces.toArray, outputTypes)
}
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Double-close of pieces when buildNonEmptyBatchFromTypes throws in the multi-batch path. ConcatAndConsumeAll.buildNonEmptyBatchFromTypes closes all input batches in its finally block (arrayOfBatches.foreach(_.close())). If it throws (e.g., OOM during Table.concatenate), both that finally and the closeOnExcept(pieces) handler call close on the same GpuColumnVector objects, driving the cuDF native ref-count negative. Narrow the closeOnExcept scope to cover only the collection loop and let buildNonEmptyBatchFromTypes take exclusive ownership for the concat step.

Suggested change
val pieces = ArrayBuffer[ColumnarBatch]()
closeOnExcept(pieces) { _ =>
while (resultIter.hasNext) {
pieces += resultIter.next()
}
val outputTypes = (0 until pieces.head.numCols()).map { i =>
pieces.head.column(i).asInstanceOf[GpuColumnVector].dataType()
}.toArray
ConcatAndConsumeAll.buildNonEmptyBatchFromTypes(pieces.toArray, outputTypes)
}
}
val pieces = ArrayBuffer[ColumnarBatch]()
closeOnExcept(pieces) { _ =>
while (resultIter.hasNext) {
pieces += resultIter.next()
}
}
val outputTypes = (0 until pieces.head.numCols()).map { i =>
pieces.head.column(i).asInstanceOf[GpuColumnVector].dataType()
}.toArray
// Transfer ownership: buildNonEmptyBatchFromTypes closes the batches itself
// (in its finally block). Do not wrap in closeOnExcept here to avoid a
// double-close if Table.concatenate throws.
ConcatAndConsumeAll.buildNonEmptyBatchFromTypes(pieces.toArray, outputTypes)
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[FEA] GpuProjectExec support for split-retry to handle OOMs.

2 participants